COLABA: Arabic Dialect Annotation and Processing

نویسندگان

  • Mona Diab
  • Nizar Habash
  • Owen Rambow
  • Mohamed Altantawy
  • Yassine Benajiba
چکیده

In this paper, we describe COLABA, a large effort to create resources and processing tools for Dialectal Arabic Blogs. We describe the objectives of the project, the process flow and the interaction between the different components. We briefly describe the manual annotation effort and the resources created. Finally, we sketch how these resources and tools are put together to create DIRA, a termexpansion tool for information retrieval over dialectal Arabic collections using Modern Standard Arabic queries.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Borrowing the Verb “ast” and Its Varieties in Arabic Dialect of Sarab

“Borrowing” is a lingual process that is studied in diachronic linguistics. In this process a language borrows elements from another language. This process usually occurs in areas that two languages make contact with each other. In a dialect spoken in South Khorasan the language borrowing happens. Arabs living in this part of Iran probably have immigrated in the early centuries of Islam. In thi...

متن کامل

The Status of [h] and [ʔ] in the Sistani Dialect of Miyankangi

The purpose of this article is to determine the phonemic status of [h] and [ʔ] in the Sistani dialect of Miyankangi. Auditory tests applied to the relevant data show that [ʔ] occurs mainly in word-initial position, where it stands in free variation with Ø. The only place where [h] is heard is in Arabic and Persian loanwords, and only in the pronunciation of some speakers who are educated and/or...

متن کامل

Arabic Dialect Identification

The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a nontrivial manner from the various spoken regional dialects of Arabic – the true “native languages” of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. In this article, we des...

متن کامل

Developing and Using a Pilot Dialectal Arabic Treebank

In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approxim...

متن کامل

Saudi Twitter Corpus for Sentiment Analysis

Sentiment analysis (SA) has received growing attention in Arabic language research. However, few studies have yet to directly apply SA to Arabic due to lack of a publicly available dataset for this language. This paper partially bridges this gap due to its focus on one of the Arabic dialects which is the Saudi dialect. This paper presents annotated data set of 4700 for Saudi dialect sentiment a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011